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Abstract 



We present an algorithm for finding sparse solutions of the system of linear equations <t>x = y 
with rectangular matrices <I> of size n xN, where n < N, when measurement vector y is corrupted 
by a sparse vector of errors e. 

We call our algorithm the £^ -greedy-generous (LGGA) since it combines both greedy and gen- 
erous strategies in decoding. 

^ I Main advantage of LGGA over traditional error correcting methods consists in its ability to 

work efficiently directly on linear data measurements. It uses the natural residual redundancy of the 
^ . measurements and does not require any additional redundant channel encoding. 

We show how to use this algorithm for encoding-decoding multichannel sources. This algo- 
rithm has a significant advantage over existing straightforward decoders when the encoded sources 
O ' have different density/sparsity of the information content. That nice property can be used for very 

^^ ' efficient blockwise encoding of the sets of data with a non-uniform distribution of the information. 

The images are the most typical example of such sources. 

The important feature of LGGA is its separation from the encoder. The decoder does not need 
^ I any additional side information from the encoder except for linear measurements and the knowledge 

H I that those measurements created as a linear combination of different sources. 

1 Introduction 

The problem of finding sparse solutions of a system of linear equations 

a>x = y, x£R^,yeR",N>n, (1) 

is interesting in many Information Theory related contexts. It is tractable as reconstruction of 
a sparse data vector x compressed with the linear operator <J>. While from the point of view of 
classical linear algebra, system ([T} may not have a unique solution, the regularization of the prob- 
lem in the form of the solution sparsity allows to guarantee the uniqueness (or the uniqueness almost 
for sure) in many practically important cases. 

Within this paper, we say that a vector a € W" is sparse if 

k := #{/ I a,- 7^ 0, 1 < / < m} < m, 

i.e., it has some zero entries. This number is called also the Hamming weight of the vector x. 
For a randomly selected matrix <I> and a vector y, if the sparse solution of system (dJ exists, it is 



unique almost for sure. In particular, a sparse, i.e., having at most « — 1 non-zero entries, vector x 
theoretically can be restored from its measurements y almost for sure. 

Unfortunately, a straightforward search for sparse solutions to ([B is an NP-hard problem ( ||T3l ). 
The NP-hardness of the problem does not contradict the existence of the algorithms for finding 
sparse solutions in some quite typical cases. 

The recent Compressed Sensing (Compressive Sampling) studies gave a big push for the devel- 
opment of the theory and affordable algorithms for finding sparse solutions. It turned out that for a 
reasonable (not very large) value of the sparsity the vector x can be recovered precisely, using Linear 
Programming Algorithm (LPA) for finding the solution to (H) with the minimum of ^'-norm (|[T], 
lITl, lUS). While, in practice, the number k characterizing the sparsity of the vectors x which can be 
recovered with £^ -minimization is far from the magic number n— \, this case is a well-studied and 
reliable tool for solving systems ([T} for many applied problems. 

Orthogonal Greedy Algorithm (OGA) is a strong competitor of LPA. If it is implemented ap- 
propriately (im, IIT4I '). it outperforms LPA in both the computational complexity and in the ability 
to recover sparse representations with the greater sparsity number k. In some papers (e.g., |8|), this 
modification of OGA is called Stagewise Orthogonal Matching Pursuit (StOMP = StOGA). 

Very resent success in the efficient recovery of sparse representation is due to paper |[T2| . The 
authors suggested to use a special (band diagonal) type of sensing matrices in combination with a 
belief propagation-based algorithm. While the algorithm [12] gives a very powerful tool in linear 
methods of data compression, it is oriented on a special form of measuring matrix which seems to 
be theoretically less efficient from the point of view of the error correcting capability. 

We do not discuss hear the advantages and disadvantages of different decoding algorithms in 
detail. For the goals of this paper, we are interested in algorithms supporting the opportunity to 
assign non-equal significance of the solution entries. Such algorithms use the idea of reweighted 
/ greedy optimization (see lH, ||5l, ||6l, 191, ifTOl . ifTTI ). In what follows, we use our ^^-greedy 
algorithm (LGA, see Algorithm A below) as a starting point for the algorithms considered in this 
paper. LGA has 2 advantages over competitors. First of all, it was shown numerically in [10| that 
LGA has the highest capacity of the recovery of sparse/compressible data encoded with Gaussian 
matrices. Second, what is more important, it is easily adaptable to the needs of this paper. 

The absence of theoretical justification and relatively high computational complexity are main 
disadvantages of LGA. However, it should be emphasized that the main competitors of LGA also do 
not have appropriate theoretical justification. As for the computational complexity, the fast version 
of LGA was developed in lOTI . Its computational complexity is about the complexity of the regular 
^'-minimization, whereas the reconstruction capacity is very close to regular LGA and significantly 
higher than other, even more computationally extensive, algorithms. 

Above and in what follows, saying that the algorithm has "the higher recovery capability", we 
mean that for the same n and A^ the algorithm is able to recover vectors x with a larger number k of 
non-zero entries. Of course, the relative capability of algorithms depends on the statistical model of 
non-zero entries. We use only the Gaussian model since a Gaussian random value has the highest 
information content among random values with the same variance. The reweighted algorithms are 
known as inefficient for vectors x with th Bernoulli distribution (equally probable ±1 for non-zero 
components). However, we believe that CS for Bernoulli input has the less applied value since 
Bernoulli distribution has a very low information content and the linear methods of compression 
ai^e extremely inefficient for them. At the same time, for distributions having the sparsity (the decay 
rate of non-zero entries) higher than for the Gaussian distribution the capabiUty of greedy-based 
algorithms is higher than for the Gaussian distribution. 

The state-of-the-art approach to data encoding is based on the concept (due to C. Shannon) that 



the optimal data encoding for transmission through a lossy channel can be splitted into 2 indepen- 
dent stages which are source and channel encoding. 

The first stage is source encoding or compression. Usually, compression is a non-linear opera- 
tion. In the ideal case of the optimal compression, its output is a bitstream consisting of absolutely 
random bits with equally probable "0"s and "l"s. The compression reduces the natural redundancy 
of the encoded data. 

The second stage is channel encoding. It plays the opposite role, introducing the artificial 
redundancy. Usually (but not necessary), this is a linear operator on a Galois (say binary) field. The 
redundancy introduced here is different from the redundancy removed on the compression stage. 
Its model is completely known to the receiver of the information (the decoder) and in the case of 
moderate corruption this model can be used for the perfect recovery of the transmitted data. 

While the natural redundancy of the source also can be used for error correction, it is not so 
reliable for this purpose as channel encoding. Nevertheless, in some deeply studied problems a kind 
of "error correction" based on the natural redundancy is possible. Among those applied problems 
we just mention various data denoising methods and digital film restoration. The natural redundancy 
arises when a data model does not allow the digital data representation to take arbitrary values. For 
the denoising the belonging of the data to some class of smoothness serves for protection against 
corruption. Whereas impossibility of big changes between frames following with the rate 20 - 30 
frames per second plays the same role for moving images recovery. In both cases, the redundancy 
model is known only approximately. 

One more problem indirectly conneted to the error correction is image upsampling. The inpaint- 
ing of missing information is tractable as correcting "errors" lost in a channel with erasures. 

In CS community, applications to error correcting codes were noticed practically simultaneously 
with the mainstream compression issue (cf., ||2|, ||3], IJISil ). Repeating commonly accepted channel 
encoding strategy, this approach consists in introducing the redundancy by multiplying the data 
vector y (say, the "compressed" output in ([T])) by a matrix B of the dimension mxn, where m> n. 
We assume that rankB = n. Then the output vector 

z := By = B<t>x 

is protected, at least theoretically, from the corruption of up to m — « — 1 its entries. To understand 
the mechanism of this protection, introduce the matrix C of dimension m x [m — n) whose columns 
constitute a basis of the orthogonal complement of the space spanned by the columns of B. As- 
suming that e is a sparse vector of errors, compute the vector s := C'^(z + e), known in the Coding 
Theory as a syndrome. The syndrome can be measured by the receiver from corrupted information. 
At the same time, C^z = C^By = 0, the corruption vector e can be found as a sparse solution of the 
system 

The scheme considered in the previous paragraph is in the intersection of mainstreams of CS 
and Coding Theory based on separation of the source and channel encoding. 

In this paper, we discuss a different data protection scheme which uses the residual space not 
occupied by the data for error correction purposes. In the literature on Coding Theory, a data trans- 
form providing simultaneous compression and protection from transmission errors is called Joint 
Source-Channel Coding. As we will see below, the plain linear measurements have this property 
provided that the encoded data are sparse enough in some representation system and the vector of 
corruption is also sparse. Therefore, in this case, the compressed data are restorable without any 
additional channel encoding. To our knowledge, the first time, this idea was formulated in |[T6l . 



In this paper, we do not consider any metliods for fighting the corruption with the noise in 
entries when the level of the corruption of the output vector y is relatively low but the corruption 
takes place at each entry. However, we will discuss the stability of our algorithm with respect to 
such corruption. 

We emphasize that the encoding model accepted in this paper is analog encoding, i.e., we do not 
mean to apply analog-to-digital transform to the measurements. The entries are stored (transmitted) 
in the form of their magnitude (not in bitwise representation). For this reason we concider the sparse 
model of corruption when errors are introduced directly into entries of the vector y (not in bits). 

The structure of the paper is as follows. In Section |2] we descuss how corrupted data can 
be recovered with the regular LGA algorithm. In Section |3] we show that the "generous" (LGGA) 
modification of LGA provides much higher error correcting capability provided that the error rate is 
known at least approximately. In Section IH using the well known theoretical equivalency of a lossy 
channel and multisource transmission, we discuss advantages of joint encoding of a few sources of 
data. In Section[5l we design the adaptive £^ -greedy-generous algorithm (ALGGA) which does not 
require preliminary knowledge of either error rate or the relative density of information in multiple 
sources. 

While all results of the paper are based exceptionally on numerical experiments, the stability 
of those results allow to be optimistic about possible applications of algorithms designed on the 
greedy-generous principles introduced below. 

In all our numerical experiments we used random matrices O of size 128 x 256. This size 
seems to be the most popular in academic papers. The selection of other parameters which can be 
found in the text below is associated with that size and its interplay with LGA. They make sense in 
association with the size of input data. While we made experiments with different sizes of <J>, we 
decided to do not include those results because they just confirm the effects and efficiency of the 
algorithms considered in the paper and do not bring new effects deserving extra journal space. 

2 Problem setting and approaches to solving 

Let X e M^ be a sparse vector with k non-zero entries. The vector x is encoded with a linear 
transform <J>x =: y G M". The vector y is corrupted with a vector e G M" with r <n non-zero entries. 
The decoder receives the corrupted measurements 

y = <J>x + e. 

The matrix <I> is also available for the decoder. It is required to find the data vector x. Of course, 
when X is found, the error vector also can be computed. Easy to see that the problem can be rewritten 
in the form 



y = <J>x, were <I> := (<I> /„) g M"^ ^'^+"^ , x = ( 1 , (2) 

/„ is the identity nxn matrix. Assuming that the columns of the matrix <I> are almost orthogonal to 
columns of /„ and taking into account that both x and e are sparse vectors and ^ + r < w, we arrive 
at a standard problem of finding a sparse solution of an underdetermined system of linear equations 
with the matrix <i>. We emphasize that although the columns of /„ and <J> span the same space R" and 
do not introduce separation between data and error spaces, the spans of the columns corresponding 
to the sparse vectors x and e are approximately orthogonal, so the errors can be separated from the 
data. 



Thus, the error correction problem is reduced to the problem which can be solved with any stan- 
dard CS decoder The algorithm we are goint to use for decoding is based on iterative minimization 
of the functional 

l|x||w;.i ■.= Y,Wjj\xi\, (3) 

which is the weighted £'-norm. The initial idea to use the reweighted ^'-minimization for sparse 
solutions of underdetermined system is due to [4|. We will use the following algorithm with the 
higher recovery capability. 

Algorithm A {the f -Greedy Algorithm (LGA)|[lOl) 

1. Initialization. 

(a) Take take wqj = 1, / = 1, . . . ,N. 

(b) Find a solution xq to (dJ providing the minimum for (|3j. 

(c) Set Mo := max{;co^,} and j := 1. 

2. General Step. 

(a) Set Mj := aMy_i and the weights 

e, \xj-ij\>Mj, 
1, \xj-i,i\<Mj. 

(b) Find a solution Xj to ([D providing the minimum for (|3]l. 

(c) Set 7:= 7 + 1. 

(d) If Stopping Criterion is not satisfied, GOTO 2 (a). 

LGA has a minor difference with the standard reweighted £' -minimization from ||4l. This dif- 
ference consists in dynamic updating the weighting function. Nevertheless, on the Gaussian input, 
LGA outperforms both the regular and the reweighted ^'-minimization significantly. 

The results of straightforward application of Algorithm A to the extended inputs y and O from 
(HI) will be shown below. Before we present numerical results, we describe our settings applicable 
to all numerical experiments in this paper We generate vectors x and e with correspondingly k and 
r non-zero entries with normal distribution with parameters (0,1). The entries of the matrix <J> also 
has taken from the standard normal distribution. 

We ran 200 independent trials with Algorithm A or its modifications. In each trial, the result is 
obtained after 30 iterations of the algorithm with parameters a = 0.85, £ = 0.001 or if the club of 
large coefficients, whose magnitude exceeds Mj, reaches the cardinality n. 

On Figure [T] we present the graph of the relative success rate of the recovery for /c-sparse data 
by Algorithm A when k + r = 51. The number 57 corresponds to the LGA reconstruction success 
rate about 0.82 for sensing 57-sparse vectors with random 128 x 384 matrices. The horizontal axis 
reflects the values of k in the range 1 + 57, whereas the number of errors for each point of the graph 
is computed as r = 57 — /:. 

Since we have constant the number 57 = ^ + r of "information" entries, in general, the almost 
constant behavior of the graph on Fig. [T]is predictable, except for the better performance of the 
recovery algorithm when we have less non-zero data entries in x and a greater number of errors. 
Such behavior can be explained by the fact that the errors are associated with the identity part 
of the matrix O, which has orthogonal columns, whereas the columns of the matrix <I> are only 
approximately orthogonal. Therefore errors can be found and corrected easier than the data entries. 
Of course, if the entire extended matrix O were constructed with the Gaussian distribution, the 
graph would be parallel to the A:-axis. 
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Figure 1: Error correction with Algorithm A for 4> E ^128x256 
when the total number of non-zero entries and errors is fixed (=57). 

Thus, Algorithm A works for the error correction even better than it was expected in advance. 

However, there is one very discouraging drawback of such error correction. If we run it on a 57- 
sparse vector and its CS measurements are error free, then the Algorithm A with the extended matrix 
<l> G ]^i28x384 ^jjj recover it with the probability about 0.82, whereas the same LGA algorithm 
applied to the matrix <J> € M^^^^^^se j^ ^^^^ ^^ recover vectors of the same sparsity with probability 
very close to 1. The probability 0.82 on <I> is reached by Algorithm A at the sparsity ^ = 68. Thus, 
we loose about 20% of the efficiency just because of the suspicion that the data could be corrupted. 
The curves of the reconstruction success rates for N = 256 and N = 384 when n = 128 which 
demonstrate those losses are presented on Fig. |2] 



Reconstruction for n=1 28 and N=384 / 256 
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Figure 2: Algorithm A (error free) for $ G 
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From the point of view of the multidimensional geometry everything looks reasonable. We use 
^-dimensional subspaces of W which are all possible spans of ^-subsets of a set of N vectors from 
M". If we increase that set by n vectors, then the number of /c-planes drastically increases from (^) 
to ( ^") . Therefore the ^-planes become harder distinguishable. 

From the point of view of classical coding theory those losses also are explainable. The error 
resilience requires some redundancy in the representation. The higher projected error rate the bigger 
redundancy has to be introduced in advance. If we are lucky and a channel is error free, the redun- 
dancy is already spent (actually, in this case, it is wasted) on the encoding stage. Any improvement 
of the recovery due to the absence of the errors is impossible. 

There is a fundamental difference of our situation. We do not introduce any artificial redun- 
dancy. Therefore, for the lower rate of errors we may hope for the improved performance of the 
decoder on the encoded data recovery. The next section explains how this intuitive advantage can 
be transformed into actual benefits of the decoder. 



3 Algorithm for Variable Error Rate 

Let us come back to Fig. |2] to specify what kind of error correcting algorithm we want to design. 
If we apply Algorithm A to the potentially corrupted vector y obtained as a measurement with the 
extended matrix O, then the success rate will be described by the curve corresponding to the matrix 
128 X (256 + 128) (i.e., N = 384). At the same time, provided that we know that there are no errors 
in the channel, using this piece of information, we just switch Algorithm A to the matrix <I> of size 
128 X 256 (A'^ = 256) and get much higher success rates. Thus, Algorithm A is trivially "adaptable" 
in this way to the case r = 0. 

For r = 1 the situation changes drastically. If we apply Algorithm A to <i>, this curve is just 
a translation of the curve for A/^ = 384 by 1 to the left what is close to the error free case. While 
applying Algorithm A to the matrix <I>, we do not get sparse solutions at all because the columns 
of the identity matrix do not have sparse representations in the dictionary of columns of the matrix 
<I>. In this case, even if we have side information that only one entry of the data are corrupted. 
Algorithm A applied to y and <I> becomes forceless, whereas applying it to <i> we lose in efficiency 
significantly. We wonder whether we can modify Algorithm A in such way that the success rate 
curve becomes the translate of the curve for N = 256 by r = 1 to the left rather than the curve for 
A^ = 384. 

This goal can be reached with Algorithm A if we solve 128 problems with matrices <I>, € 
]^i28x257^ / = 1,2, ... , 128, obtained as an extension of <I> with the /th column of the identity matrix 
and select the sparsest of 128 solutions. Such algorithm is computable but very computationally 
extensive and, because of this complexity, cannot be extended to r even slightly greater than 1 . 

Thus, the main question of this sections is how the information about the density of errors can 
be incorporated in Algorithm A for achieving the maximum possible success rate with minor or no 
change of computational complexity. 

Let us discuss the potential limits of the desired algorithm. First of all, when 

A' n 

i.e., the density of actual errors in e is equal to the density of non-zero entries of x, and non-zero 
entries of x and e have same distributions of their magnitudes, we cannot hope for increasing the 
efficiency, staying within the framework of LGA. Indeed, under those assumptions, we can think 
about the vector x as about a sparse vector with a randomly and independently selected index set 



of k + r non-zero components among N + n entries encoded as y with the matrix <l>. Definitely, 
the point k + r will rather correspond to the point of the curve for N + n {= 384 above) than N (= 
256 above). Thus, we cannot expect any improvement in the recovery of the vector x G M^+" with 
respect to the success rate curve for N + n. Of course, as we noticed above the result can be slightly 
better just because of the structure of <I> involving the identity matrix. 

At the same time, we have a right to expect the improvement when the equality of the propor- 
tions dUl is violated. For the case k/N ^> r/n we may expect the success rate for the recovery k + r 
non-zero entries close to the success curve corresponding to nxN matrix, whereas for k/N <^ r/n, 
we want to see the recovery rate close to the recovery rate for the identity matrix /„. 

At the first glance, the last request may look too challenging because the identity matrix encoder 
y := 4x allows the trivial "recovery" x := y. So the decoder must be extremely efficient. However, 
it is quite realistic. Indeed, when we encode x whith only one non-zero entry (i.e., k = 1) we can 
recover it with a big probability even if e has up to n — 2 non-zero entries. The direct exhaustive 
search for the 1 -sparse solution in this case has the qubic (actually, ~ rP'N) complexity. This com- 
plexity as well as the required precision are high but still realistic. Therefore the high eficiency of 
the projected algorithm for r close to n would not be considered as a miracle. 

Let us formulate our goals in the developing a new recovering algorithm with error correcting 
capability. The algorithm has to provide the property which has meaning of the 3-point interpola- 
tion. It has to work at least as Algorithm A on compressing matrices of size nx [N + n) when dU) 
takes place. Its efficiency has to approach the cases of compressing matrices of size nxN and nxn 
when r — )• and ^ — )• correspondingly. 

The idea of such algorithm relies on the reweighting strategy and some knowledge about an 
approximate error rate. 

Algorithm A is based on the greedy strategy. Its main idea is to provide the entries with the 
bigger expected magnitude with more freedom in the representation of the vector y. This freedom 
is provided by the significantly less weight in the (weighted) ^^-norm given to the entries assumed 
to be large in the "optimal" (sparse) representation of y. Then the algorithm does not worry too 
much about their contribution into the norm. So they can be used efficiently for the partial sum 
minimizing the residual. This is a typical greedy strategy when large coefficients are selected in a 
separated set and on the next iteration the biggest coefficients in the representation of the residual 
extend "the club of large coefficients". 

The greedy approach inspired us on the opposite "generous" strategy. If we know that with 
large probability the channel is (almost) error free, let us give "generously" a larger weight to the 
entire block of entries corresponding to the errors or, vice versa, we set a less weight to the entire 
error block when its density is higher than the density of the data entries. For instance, if we 
want to modify LPA algorithm according to the generous principle formulated above, instead of the 
minimization problem 

||x||i — )• min, subject to Ox = y, 

we solve the problem 

||x||i + A||e||i — ;■ min, subject to Ox = y, 

where A depends on the density of errors. The £^ -greedy algorithm from flO] (and its accelerated 
version from fTTl) reweights input data of its iterations, according to the output of its previous 
iterations. The modification of the problem above does not require any significant changes in Algo- 
rithm A itself. We just introduce a different weight for the error components not included into the 
club of large coefficients. To be more precise we replace the definition of the weight in item 2a of 



Algorithm A with 

{£, \xj^i^i\>Mj; 

1, \xj-\,i\<Mj,i<N; 

A, \xj-u\<Mj,i>N. 

We call this modification Algorithm B or the (.^ -greedy-generous algorithm (LGGA). 

The efficiency of Algorithm B is illustrated on Fig. |3] We found the curves of the success 
rate for r = 0, 1, 5, 15, 45, 90 (the graphs plotted with "o"s from right to left), applying A = 
2.0.2.0, 1.7, 1.5, 0.7, 0.55 in Algorithm B. 

Reconstruction with Error Correction 




Figure 3: Algorithm B for r = 0, 1, 5, 15,45, 90 ("o"curves from right to left), 4> e 
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On the same figure, we plot the graph corresponding to the error free case for N = 384 (the 
graph plotted with "V"s). We see that that the curve of the success rate of Algorithm A for error 
free input practically coincides with the success rate of Algorithm B for the case r = 5. So, under 
similar input conditions. Algorithm B corrects up to 5 errors when Algorithm A admits only error 
free input. 

To discuss the next issue let us move, in the mind's eye, each of graphs on Fig. |3]plotted with 
"o"s to the right by the corresponding value r. Then we have graphs of the recovery k-\-r (data and 
errors) entries. We see that for r equal to 15 and 45 (and definitely for r between them) the shifted 
graphs are very close to the graph for the error free case when N = 384. In this case, generous 
reweighting does not play any significant role and the results for Algorithm A and B are quite close. 
For the case r = 1, the shifted graph is practically coincide with the error free case when N = 256. 
The translate of the curve r = 5 by 5 brings the graph between error free graphs for N equal to 256 
and 384. We note that for the case r = 90 the shifted graph will be far to the right from the curve 
N = 256. This is completely predictable because the identity part of O is not compression at all. 
Therefore, for the matrix /„ separated from <J>, any number of entries are "recoverable" with the 
trivial "algorithm" x := y. 

If we compare the graphs on Fig. |3]with the classical ^^-minimization algorithm, then it turns 
out that the ^^-minimization curve (for n = 128, ^^^^ = 256) practically coincides with the graph 
corresponding to the correction of 15 errors (?» 12% of measurements located at unknown positions 
are corrupted ) on Fig. |3] This observation promotes not only the £' -greedy algorithm which is a 



basic component of Algorithm B but also supports one of the keystone paradigms of CS stating that 
the data measurements stored today can be used much more efficiently in a few years when new 
algorithms will be designed. 

Even staying within the sparse Gaussian model of errors, we have one parameter whose influ- 
ence was not considered yet. This is the magnitude of errors. In classical error correcting codes on 
the binary field, the magnitude is not an issue. In real-valued encoding, the influence of the magni- 
tude of the error vector on the efficiency of the recovery is not so obvious. Our experiments with 
increasing/decreasing the error magnitude in up to 10 times showed that the data reconstruction 
efficiency increases in all cases even if we do not change the generous weight. There is a common 
sense explanation of this phenomenon. The Gaussian distribution has the highest information con- 
tent among distribution with the same variance. The union of the Gaussian data and the Gaussian 
errors with equal variances has the Gaussian distribution while the weighted union is not the Gaus- 
sian one anymore. In fact, it is sparser independently on whether the weight is less or greater than 
1. 



4 Multisource encoding 



In classical Information Theory the model of the channel with errors is considered equivalently as a 
channel with 2 sources of information, i.e., the data transmitted through the channel and undesirable 
errors. From this point of view, the errors consume a part of the channel capacity, reducing the 
capacity of the channel for useful data. This observation leads to the idea to use the error correcting 
scheme considered in Section |3]for encoding/decoding data from a few sources by packing them in 
the same output vector. 

To get some idea how efficient this idea can be we consider a model case of 4 sources. In our 
numerical experiments, each source produces a data vector x,- E M.^. We do not make any special 
assumption about the sparsity of each specific x,. However, we bear in mind that the compound 
vector X G R^^^ has to be somehow sparse to make decoding possible. 

What is especially interesting for us is to outperform Algorithm A when the sources of data 
contain different amount of information, i.e., when the information is distributed non-uniformly 
between blocks. In our settings, the input consists of three blocks with the fixed sparsities 64, 5, 
3 and one block with the variable sparsity K. Thus, the total number of encoded non-zero entries 
is ^ = ^ + 64 + 5 + 3. For Algorithm B, we set the block weights equal to 3.0, 1.0, 3.6, 4.0. We 
applied only "common sense" in setting the weights. No accurate optimization was performed. The 
result of the reconstruction is given on Figj4] 

The numerical experiments show that usage of Algorithm B for decoding data from the mul- 
tisource input may bring significant benefits over the standard uniform decoding strategy. The 
reweighting the blocks completely fits the philosophy of the reweighted optimization. However, the 
new element in the LGGA consists in combine both the greedy and the generous strategies. While 
the greedy strategy sets little penalty weight in the weighted i^-norm for the entries suspicious to be 
used in the encoded data representations, the strategy of Algorithm B quite "generously" changes 
the weights for entire blocks of information. Setting larger weights in £^-norm for the blocks with 
low information contents, we "temporarily exclude" them from consideration on initial iterations 
of Algorithm B, helping blocks with the higher density of the information to be decoded more 
efficiently. 

Now we briefly describe one more possible application, where generous strategy can be useful. 
In applications above we used partitioning the matrix <I> (or O) generated by a specific problem. At 
the same time, in Section[3l we mentioned the phenomenon when the higher or the lower magnitude 
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Figure 4: Algorithm B with decoder weights (3, 1,3.6, 4) vs. Algorithm A for 4 sources with 
ki =K,k2 = 64, k^ = 5, and k4 = 3 entries, $ G ^128x256 _ 

of errors increases the capability of LGGA in error correction. Such phenomenon can be used 
artificially for increasing the algorithm capability in reconstruction of sparse data. We give one 
example showing what benefits are reachable with that approach. 

Let us spht the Gaussian matrix <t> G M"^^ into 2 submatrices <t>i £ R"^'^' and <t>2 £ W^^-, 
N1+N2 = N, and compose a compound sensing matrix *F = (<I>i, 5O2). For decoding data sensed 
with the matrix *F, when n=N\ =N2 = 128 and 5 = 0.1, we introduce little changes in the weight 
selection of Algorithm B. We define the weights as 



w 



j,i 



1, 

A, 



\Xj-l,i 
\Xj-l,i 
\Xj-l,i 
\Xj-lJ 



>Mj, 

>Mj/d, 

<Mj, 
<Mj, 



i<Ni; 
i>Ni; 
i<Nu 
i>Ni. 



The result is presented on Fig. |5] 

This experiment obviously shows that the encoding with compound matrices is a very promising 
tool for efficient encoding purposes. At the same time, it should be taken into account that there is 
a significant stability drawback of that approach. In Section |6l we have more extensive discussion 
about the algorithm stability. However, this is absolutely obvious that the second half of the data 
has the theoretical precision of the recnstruction 1 /5 times lower than the precision of the first half. 
So this trick is kind of exchange of the higher precision for the larger number k. From the point of 
view of settings of formal information theory, this is not progress at all. Anyway, this idea can be 
used when we do not need the same precision for different blocks of information or as a constructive 
brick for other algorithms, where such compound matrices make a sence (e.g., cf. lfT2]l ). 



5 Adaptive ^^-Greedy-Generous Decoding Algorithm 

In Sections [3] and |4] we showed how Algorithm B, applied to linearly compressed data from multiple 
sources, can efficiently recover the encoded information. The tuning of Algorithm B uses the side 
information about the density of the information in each input sources. 
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Embedding this side information in the code may bring additional (probably non-linear) proce- 
dures like protection from the corruption in the channel. While the amount of this side information 
is tiny, this scheme violates the genuine CS architecture which is lineai" by nature. Moreover, in 
most of practical cases, the information density values are not immediately available even for the 
encoder Potentially, it may be extracted by the encoder from the raw data at some computational 
expenses comparable or exceeding CS expenses itself. Say, to get the sparsity information about an 
image the wavelet tiansform has to be applied. In other cases, like for the channel with errors, the 
encoder is not aware about the errors in the channel at all. The error rate of a stationary channel can 
also be measured by sending the empty message to the decoder Thus, to apply Algorithm B we 
need the distribution of information over the blocks, assumed to be obtained from the encoder, and 
information about channel errors, assumed to be measured by the decoder 

Due to reasonings above, we arrive at conclusion that the decoding algorithm having an internal 
estimator of the amount of information encoded in each block is very desirable. Such estimator has 
to use only the raw CS encoder output. The block structure, i.e., the partitioning of the vector x into 
the blocks with potentially different information density, also has to be known for the decoder 

Below, we suggest an adaptive algorithm working on two sources of information. While it is 
quite efficient within the considered settings, it serves as just one successful example on the way to 
a really universal adaptive algorithm acting on the variety of block configurations as well as on the 
different information contents of the blocks. 

We conducted some research bringing the algorithm for adapting weights between blocks, ac- 
cording to intermediate internal estimates made on LGGA iterations. Due to iterative nature of LGA 
we do not try to find the optimal weights for the blocks from the beginning. We rather dynamically 
update those weights in the direction of the higher sparsity of the result. For our estimates, we 

compute 0.5-quasi-norm ||x||o.5 := (L\/l^)^ normahzed by the Euclidean norm ||x||2 := yHxf: 

s{x) := ||x||o.5/||x||2. 

Since the vector x is unknown in the beginning of the decoding, we will use heuristic methods 
measuring the relative behavior of this characteristic for the blocks of the vector x on consecutive 
iterations of LGGA. 



We consider only the simplest case of two blocks of information with potentially different den- 
sities. We assume that the blocks have equal size, i.e., for modeling settings from previous section 
when N = 256, we assume that the block sizes Ni and N2 are equal to 128. Let us agree that 
the index sets are {1, . . . ,N/2} and {N/2 + 1, . . . ,A'^} for the first and for the second blocks cor- 
respondingly. We will use the superscripts 1 and 2 for denoting the corresponding subvectors and 
the subscripts "p" and "c" for the minimum of ^^-norm (pseudoinverse) solution to Ox = y and the 
current estimate of x made by LGGA. 

The next iteration weights will be functions of the value 

S(x2).5(x],)' 

Our heuristic approach is based on the experimental observation that when the second block has 
lower density of the information and the algorithm transforming Xp into x^ has some sparsifying 
properties, then 5 > 1 at least in the sense of the expectation. The value of S grows with the growth 
of the non-uniformity of the information distribution between the blocks. 

Accepting those results, we have to introduce a function setting the block weights in the greedy- 
generous algorithm. We set the weight Wi = 1 for the first block and find the second weight as 

...._. / 0.65^/5 + 0.35^, 5>1; ... 

^^^^)-\ i0.65Vs + 035^)-\ S<1. ^^^ 

Algorithm C {adaptive £^ -Greedy-Generous Algorithm) 

1. Preliminaries 

(a) Compute the minimum £^-norm solution Xp := <I>^(<I><I>^)^'y. 

(b) Compute s(Xp) and s{x^) 

2. Initialization. 

(a) Set wo,,- = 1, / = 1, . . . ,A^, Wi = 1. 

(b) Find the solution xq to ([U providing the minimum for (|3]l. 

(c) Set Mo := max{|xo,,|} and j := 1. 

3. General ALGGA Step. 

(a) Finds and W2 = W2iS) 

(b) Set Mj := aMy_i and the weights 

>Mj, 

<Mj,i<N/2 
<Mj,i>N/2. 

(c) Find the solution x^ to ^ providing the minimum for ^. 

(d) Set J := 7 + 1. 

(e) If Stopping Criterion is not satisfied, GOTO 3 (a). 

The results of our numerical experiment are presented on Fig. [6l For all graphs the horizontal 
coordinate k corresponds to the total number of non-zero entries in both blocks. The graphs plotted 
with "*" correspond to the recovery with Algorithm B when the density of the information is known 
before decoding, whereas the graphs plotted with "o" reflects the results obtained with adaptive 
Algorithm C. There are four pairs of the graphs. 
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Figure 6: Algorithm C ('o', adaptive) vs. Algorithm B ('*', optimal) success rates 
for 2-channel input k = ki + k2, ^2 = 37, 15,5, 1 (from left to right). 



The most left pair corresponds to the uniform infonnation distribution between blocks. 

To be more precise, we set a number of non-zero entries in block 2 as 37 and a number ki of 
non-zero entries in block 1 is changing, giving the total number k = ki + 37. The number 37 is 
one half of the value corresponding to the level 0.5 of the success curve for Algoritm A. Thus, 
in the neighborhood of /c = 74 = 2 x 37 we have non-zero entries uniformly distributed over the 
entire range 1 -^ 256. Of course, in this case, the optimal settings for Algorithm B are VKi = VK2 = 1. 
Therefore Algorithms A and B have identical output. 

Since those 2 most left curves are practically coincide, this means that Algorithm C makes a 
reasonable decision, setting W2 ~ 1 and showing its stability. 

The second pair of curves from the right corresponds the case ^2 = 15. The Algorithm C output 
is very close to the output of Algorithm B (with W2 = 1.7). For the case ^2 = 5 (the third pair of 
curves). Algorithm C gives very big advantage over Algorithm A (the case of uniform distribution), 
however it concedes visually to Algorithm B (with W2 = 2.5). For the case ^2 = 1 (the most right 
pair of graphs). Algorithm C is significantly less efficient than Algorithm B (with W2 = 6.5) but has 
a huge advantage over Algorithm A. 

We would like to emphasize that the function defining the weight W2 in ^ does not pretend 
to be optimal or somehow universal. We just wanted to show that the idea of efficient decoding of 
multichannel code with unknown characteristics of channels may have a quite satisfactory adaptive 
solution. Anyway, we believe that the significant part of Algorithm C losses for highly non-uniform 
distributions is mainly because of statistical inconsistency of the estimate of S rather than imperfec- 
tion of formula (JS). This is obvious that a few non-zero values have insignificant influence on the 
values of sparsity estimates, especially for pseudoinverse solutions s{xp). Therefore the significant 
fluctuations of S may mislead formula dD in setting IV2. 

Experiments with change of the variance of x^ give an argument in favor of our claim above. 
Indeed, we introduce a factors for x^ in the range 0.1 -^ 10 and checked the reconstruction with 
Algoritm C. In all cases, the higher rate of recovery was observed. In particular, for the factor 10, 
the curve for k2 = I became practically coinciding with the Algorithm B curve from Fig. |6] We 
suppose that the reason of this improvement is the increased consistency of the sparsity estimate. 



6 ^^-Greedy-Generous Algorithm Stability. 

The stability of the algorithm with respect to the noise in measurements is a necessary requirement 
for its applicability to the real world problems. The noise may have very different reasons. The 
most typical forms of noise are noise of transmission and noise of quantization. Any noise reduces 
the precision of the result. However, technically, the goal of this section is to estimate the stability 
of the fact of recovery of sparse representations with reasonable precision rather then fighting for 
the best possible precision itself. The CS decoding is a highly non-linear operation. So the stability 
means for CS much more than just the rate of the dependence of ||x — x|| from ||y — y|| because 
even the continuity of such dependence is under question. Generally, no algorithm can be stable for 
all k not exceeding n — \. Indeed, each value of the measurement precision has its own theoretical 
maximum level sparsity k admitting the recovery. In particular, the case k = n — \ requires infinitely 
high precision when N /n > a > 1; « — > oo. For k/n = j3 < 1, the reconstruction with moderate pre- 
cision is possible up to some level of noise in y. For the noise/precision level above that threshold, 
no precision of reconstruction can be guaranteed. The closer j3 to 1 , the lower the threshold is. We 
checked the result of reconstruction for many different levels of noise. The greedy-generous algo- 
rithm has shown the high stability to the noise in the measurements y. Let us describe the settings 
of numeric experiments supporting the stability claim. 

In the noisy environment, we run the same adaptive LGGA with the same parameters but we 
change the criterion of success. The threshold value for "success" was set as 5 = 1 .5 • '2^^\^l^)l^ y/na, 
where 

H{p) := -plog2P - {I - p)log2il - p), 

a is the noice variance. When ||x — x|| < 5, where x is the estimate obtained with the algorithm, we 
say that the reconstruction of x is successful. 

We use an indirect approach for setting the threshold 5. The reasoning is as follows. Due to 
normalization of the matrix <I>, its random n xn submatrix is "almost" unitary. So in the best case 
scenario any deviation from the correct values in the y-domain will be conversed into the same 
deviation in y-domain. Let us obtain a numerical estimate for this conversion. First of all, the noise 
level a at n entries gives the total noise ^/na at y. The level of the noise can be translated into 
a number of significant bits in the entries of y. Unfortunately, we cannot hope that all those bits 
to be used for representation of components of the vector x. Indeed, the linear encoding formula 
([T|l assumes that the data vector x can be recovered from the measurements y. In particular, this 
means that the vector y has to contain not only information about bits in digital representation of 
x but also (maybe implicitely) information about indices of non-zero entries. For encoding that 
information we need H{k/N) bits per entry. The total number of entries in x is N. The information 
about indices has to be put into the vector y consisting of n entries, i.e., each entry of y has to 
"reserve" NH{k/N)/n bits for encoding the indices. Those bits iminently decrease the precision 
of the reconstruction at least in 2^^('^/^)/". Thus, we cannot expect the precision of reconstruction 
||x — x|| higher than 2^^^^l'^'>l"y/n(y. Numerical experiments shows that this value is really very 
close to the precision of the estimate x. We introduce the additional factor 1.5 just to avoid unfair 
"losses" of the success when the precision is slightly higher due to some random fluctuations. 

The graphs of the success rate for non-uniform distribution of the information with a fixed 
number ^2 = 5 (right triplet) and k2 = 37 (right triplet) for a = 0, 0.01, 0.03 are given on Fig.|7] 

The exciting fact that "the best case scenario" estimate turns out to be a good tool for the 
prediction of the deviation of the algorithm output from the true vector is very optimistic for CS 
perspectives. Indeed, the threshold estimate is based on elementary information theory principles 
which cannot be overcome with any algorithm. Compression of information allowing this precision 
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Figure 7: Algorithm C success rate for noise o = 0(o),0.01 (*),0.03(V), 
for k2 = 37 (left) and k2^5 (right). 

of the reconstruction can be achieved if we separately encode the digital (quantized) information 
about the entries and the information about indices. The index part of the information can be 
encoded with the optimal bit budget by the arithmetic encoder whose output is extremely vulnerable 
to the errors. In fact, the results of numerical modeling shows that linear compressor ([T} encodes the 
combination of digital bits and indices in very wise form. The indices are encoded implicitly. They 
are a part of Joint Source-Channel linear encoding. Within a wide range of losses of precision in the 
entry representation, reconstruction does not destroy the structure (index information), providing 
decoding with the quality progressively depending on the channel noise. 

7 Conclusions 



It is shown that Compressed Sensing encoding has very strong internal error correcting capabil- 
ity. No special redundant encoding is required for further error correction. Essentially, the error 
protection capability is defined by the amount of the space underloaded with data by the encoder 
The less the data information content, the higher natural error protection level of the code. Thus, 
Compressed Sensing provides highly efficient Joint Source-Channel Encoding method. 

We showed that a minor modification of the ^^ -greedy decoding algorithm, which we call the 
£' -greedy-generous algorithm (LGGA), allows to correct errors efficiently. 

Using the well-known parallel between combination of data and errors in the channel with a 
few sources joint encoding, we applied the same decoding algorithm to decoding the multi-source 
code. In the case of non-uniform informational contribution of the sources into the encoder output, 
our new £' -greedy-generous algorithm significantly outperforms all known algorithms, including 
£i -greedy, in recovering Gaussian data encoded with Gaussian matrices. 

While the knowledge of approximate distribution of information between blocks is very desir- 
able, this distribution can be estimated dynamically during iterations of LGGA. We suggest algo- 
rithm automatically adapting the "generous" weights of the ^^-greedy-generous algorithm to input 
with unknown (possibly non-uniform) distribution of information between blocks. 
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